model: support Longcat-Flash (help wanted)#19182
model: support Longcat-Flash (help wanted)#19182ngxson wants to merge 5 commits intoggml-org:masterfrom
Conversation
|
Huh, interesting. Likely need to extend |
Hello, I'm also paying attention to the adaptation of the loncat-flash model. Suppose If we bypass the requirement of |
|
@hebangwen Not sure I follow, but I think simply setting the coefficient like this should work: # normal expert
alpha_i = 1.0f
beta_i = 0.0f
# zero-compute expert
alpha_i = 0.0f
beta_i = 1.0fAnd the matrix C is just the MoE input (i.e. |
|
It can be simpler to explain the
As @ggerganov suggested, I imagine the In the example above, computation for However, one issue is that the router weight However, yet another problem, even when the idea above is implemented: The output dim of For the calculation of the |
|
Seems like quite more works than I initially thought, so I think we should re-consider if this worth implementing. Currently, only longcat-flash family using this technique, so it can be quite risky to too many infrastructure to support it. |
|
Yes, seems more complicated. Let's reconsider later in case this architecture shows any promise. |
I was working on #19167 but realized that the normal (non-ngram) model is not even supported yet.
Thinking it will be simple, I gave it a try, but ended up stuck at implementing their notion of "zero-computing experts" (ref: link to paper)
The main problem is that
ggml_mul_mat_idisn't made for this purpose and I have no idea how to adapt it, or which ops may need to be added to make it work.To illustrate what's the problem, I will take an example of how a normal MoE FFN work:
ggml_mul_mat_idThis means we spend the the same amount of computation for each token, proportionally to
n_expert_usedHowever, with longcat-flash:
n_zero_expertswill go through FFN; for the rest, they skip the FFN altogetherApart from the weird MoE, the model has double block architecture, meaning there are 2 attentions and 2 FFNs per layer. Upon converting to GGUF, we convert it to a model of
2 * n_layer, which make the implementation much easier.